Exploratory Data Analysis in R

Phil Chodrow, MIT

Tuesday, August 28th, 2018

Introduction

Data Science Is:

  • Gathering data that matters.
  • Asking questions that matter about your data.
  • Choosing appropriate methods to answer those questions.
  • Implementing solutions that meets stakeholder needs.

My Favorite Picture

My Favorite Picture

My Favorite Second Favorite Picture

My Favorite Third Favorite Picture

(Image credit: Hadley Wickham)

My Third Favorite Picture:

  • Exploratory data analysis (EDA) covers roughly the “transform” and “visualize” steps.

EDA is…

…getting to know your Data

What We’ll Do This Morning

  • Ask an impactful question of a real data set.
  • Use EDA to propose candidate answers.
  • Create a simple business intelligence (BI) dashboard to guide decision-makers.

The Case

The Case

FYI: Base R and the Tidyverse

  • You can do EDA in “base” R without any packages.
  • But base R is…not a good programming language.
  • We will use the Tidyverse, a set of packages that promote code which is easy to write and read, highly performant, and consistent through the data scientific pipeline.
  • The Tidyverse has been extensively developed by Hadley Wickham and collaborators over the last decade.

If you have prior experience in R and did not begin all your scripts with library(dplyr)….

FYI: Base R and the Tidyverse

Getting Started

Case Study

  1. Inspect the data
  2. Filter(), sort(), and select()

Pipes for your Data

Pipes for your Data

  • x %>% f() \(\Longleftrightarrow\) f(x)
  • “Take x, and then do f to it”
  • x %>% f(y) \(\Longleftrightarrow\) f(x,y)
  • x %>% f(y) %>% g(z) \(\Longleftrightarrow\) g(f(x,y),z)
  • “Take x, then do f with option y, then do g with option z…”

Some Simple Examples

Let’s try this out – back to the case study!

Summarising Data

Data Summaries

  • You should usually summarise your data before turning on the fancy algorithms – sometimes the story is clear.

Basic Data Summaries

Go from this:

id neighbourhood_cleansed review_scores_rating
12147973 Roslindale NA
3075044 Roslindale 94
6976 Roslindale 98
1436513 Roslindale 100
7651065 Roslindale 99
12386020 Roslindale 100

Basic Data Summaries

…to this:

neighbourhood_cleansed n mean_rating
Leather District 5 98.33333
Roslindale 56 95.38000
West Roxbury 46 95.21212
South Boston Waterfront 83 94.43103
Jamaica Plain 343 94.15932
Longwood Medical Area 9 94.00000

Summaries the Tidy Way

Keeping Current

  1. More practice with filter and summarise
  2. joining data

How Recent is our Info?

## # A tibble: 1 x 2
##   earliest   latest    
##   <date>     <date>    
## 1 2016-09-06 2017-09-05

But some of these listings may be “zombies” without recent availability. How can we include only listings with availability from a certain time period?

The Approach

  1. Operate on the calendar table (exercise)
  2. join that information to the listings table (together)
  3. Filter the listings table accordingly (together)

Relational Data

The information we need is distributed between two tables – how can we get there?

We need a key column that tells us which calendar rows correspond to which listings.

listings$id corresponds to calendar_listing$id

join

The join family of functions lets us add columns from one table to another using a key.

  • x %>% left_join(y) : most common, keeps all rows of x but not necessarily y.
  • x %>% right_join(y) : keeps all rows of y but not necessarily x.
  • x %>% outer_join(y) : keeps all rows of both x and y
  • x %>% full_join(y) : keeps only rows of x that match in y and vice versa.

We’ll use left_join for this case – let’s try it in the case study.

Getting Visual

  1. Graphical Excellence
  2. The Grammar of Graphics
  3. ggplot2

My Third Favorite Picture:

Graphical Excellence

Graphical Excellence

Graphical excellence is the well-designed presentation of interesting data – a matter of substance, of statistics, and of design. Graphical excellence consists of complex ideas communicated with clarity, precision, and efficiency.

Edward Tufte

Graphical Excellence (edwardtufte.com)

The Grammar of Graphics

A grammar is a set of components (ingredients) that you can combine to create complex structures (sentences, recipes, data visualizations). In baking….

  • A body – typically some kind of flour)
  • Binder – eggs, oil, butter, applesauce
  • A rising agent – yeast, baking soda, baking powder
  • Flavoring – sugar, salt, chocolate, vanilla
  • Get it wrong, and…

The Grammar of Graphics

  • Puts the gg in ggplot2.
  • Formulated by Leland Wilkinson.
  • Implemented in code by Hadley Wickham, now part of the tidyverse

Ingredients of a data visualization

  • Data: almost always a data_frame
  • Aesthetic mapping: relation of data to chart components.
  • Geometry: specific visualization type? E.g. line, bar, heatmap?
  • Statistical transformation: how should the data be transformed or aggregated before visualizing?
  • Theme: how should the non-data parts of the plot look?
  • Misc. other options.
  • (+ plays the same role in ggplot2 that %>% does in data manipulation.)

First Plot

Does getting lots of reviews usually mean you get good reviews?

First Plot

First Plot

First Plot

First Plot

First Plot

First Plot

First Plot

Changing Aesthetics

As a Heatmap

Exercise 6

The following code computes the average price of all listings on each day in the data set:

Use geom_line() to visualize these prices with time on the x-axis and price on the y-axis.

Exercise 6 Sample Solution

Exercise 7

Using the summary_table object you created earlier, make a bar chart showing the number of apartments by neighbourhood. In this case, the correct geom to use is geom_bar(stat = 'identity').

Exercise 7 Sample Solution

Let’s Clean This Up a Bit

Comparisons: Fill, Color, and Facets

From Exercise 7

From Our First Plot

From Our First Plot

Mini-Project

The Setup

  • You are a seasoned data scientist who has just arrived in Boston.
  • You’re want to go see the sights, but you don’t know where they are!
  • Instead of buying a tourist guide like a Muggle, you are going to use your skills in exploratory data analysis to find cool spots to visit.

Instructions

  • Your deliverable is dashboard.rmd, which you can find in the 1_orientation/2_data_science/code directory.
  • Personalize the dashboard with your name and your partner’s name.
  • Conduct analysis to identify three locations or neighborhoods you want to visit. Use Google to figure out what might be interesting about those spots, but your commentary should show how your analysis supports your findings.
  • At the end of the hour, you will upload dashboard.html at the provided link.
  • A subset of groups will be randomly chosen to briefly present.

Additional Resources

Map of the Tidyverse

Guides and Cheatsheets

Books and Courses

Other Topics in R

Some Cool R Tricks